<img src="pic/collaborativeFiltering.jpg", width="700" height="900">
The data set contains information about users, gender, age, and which artists they have listened to on Last.FM. In our case we only use Germany’s data and transform the data into a frequency matrix.
We will use this to complete 2 types of collaborative filtering:
Item Based: which takes similarities between items’ consumption histories
User Based: that considers similarities between user consumption histories and item similarities
In [5]:
import pandas as pd
from scipy.spatial.distance import cosine
# Data was already dlownloaded.
data = pd.read_csv('data/lastfm/lastfm-matrix-germany.csv')
# check out the data set you can do so using data.head():
data.head(6).ix[:,2:10]
Out[5]:
In [6]:
#In item based collaborative filtering we do not care about the user column.
data_germany = data.drop('user', 1)
In [7]:
#Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns, columns=data_germany.columns)
Now we can start to look at filling in similarities. We will use Cosin Similarities. In Python, the Scipy library has a function that allows us to do this without customization.
In essense the cosine similarity takes the sum product of the first and second column, then dives that by the product of the square root of the sum of squares of each column.
In [8]:
# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for i in range(0, len(data_ibs.columns)) :
# Loop through the columns for each column
for j in range(0,len(data_ibs.columns)) :
# Fill in placeholder with cosine similarities
data_ibs.ix[i,j] = 1 - cosine(data_germany.ix[:,i], data_germany.ix[:,j])
In [10]:
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,11))
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
# --- End Item Based Recommendations --- #
With our similarity matrix filled out we can look for each items “neighbour” by looping through ‘data_ibs’, sorting each column in descending order, and grabbing the name of each of the top 10 songs.
In [10]:
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns, columns=range(1,11))
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
In [ ]:
Show the results!
In [21]:
data_neighbours.ix[:10, :5]
Out[21]:
The process for creating a User Based recommendation system:
Have an Item Based similarity matrix at your disposal (DONE)
Check which items the user has consumed (listened or purchased): if consumed, then we do not recommend.
Otherwise,
Recommend the songs with the highest score (i.e., weighted average of consumptions)
We first need a formula. We first calcuate the inner product of two vectors (the one containing purchase history; and the one containing similarity scores to the current song), then divide that figure by the sum of the similarities in the respective vector.
In [60]:
# Helper function to get similarity scores
def getScore(history, similarities):
return sum(history * similarities) / sum(similarities)
The rest is a matter of applying this function to the data frames in the right way. We start by creating a variable to hold our similarity data. This is basically the same as our original data but with nothing filled in except the headers.
In [61]:
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]
In [63]:
data_sims.head(3).ix[:, :10]
Out[63]:
We now loop through the rows and columns filling in empty spaces with similarity scores.
Note that we score items that the user has already consumed as 0, because there is no point recommending it again.
In [64]:
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0, len(data_sims.index)):
for j in range(1,len(data_sims.columns)):
user = data_sims.index[i]
product = data_sims.columns[j]
if data.ix[i][j] == 1:
data_sims.ix[i][j] = 0
else:
product_top_names = data_neighbours.ix[product][1:10]
product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
user_purchases = data_germany.ix[user, product_top_names]
data_sims.ix[i][j] = getScore(user_purchases, product_top_sims)
We can now produc a matrix of User Based recommendations as follows:
In [68]:
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]
Instead of having the matrix filled with similarity scores, however, it would be nice to see the song names. This can be done with the following loop:
In [69]:
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
In [70]:
# Print a sample
print data_recommend.ix[:4,:5]
In [ ]: